This is Assignment 14 - Morinne Kearns

Requirement 3

Exploratory Data Analysis of Cancer Rates vs Water Quality

Overview: Current Cancer Rates vs Past Water Violations

This goal of this project is to compare current cancer rates from 2014-2018 with the number and amount of TRI pollutant exceedances from 2007-2011 by county and determine if there is a correlation between the two. Twenty random counties will be chosen from the counties available from the National Institutes of Health (NIH) National Cancer Institute (NCI) State Cancer Profiles. The water violation data will then be obtained for each county or comparison.

Part 1: Cancer Rates

The current cancer incidence rates were obtained through a search of all types of cancer from 2014-2108, including all races, both sexes, and all ages. Results were obtained for the United States and organized by county. The table is sorted with highest rates at the top and lowest rates at the bottom.

The Incidence Rate is cases per 100,000 population per year and is age-adjusted to the 2000 US standard population.

Source: Cancer Incidence Rate Report for United States by County

Check data types:

1.A: Clean up data using knowledge of Pandas

1B: Exploration of Cancer Incidence Rates

Comparison between states

Since we need numeric types for the numeric data, let's start by checking our current data types:

The Average Annual Count needs to be converted to numeric types first using the to_numeric() method

The data is arranged by county, but we will be looking at the data at the state level, so we need to aggregate the counts by state

Histogram of State Aggregate Average Annual Count

Unfortunately, this only shows us total counts, without taking into consideration the population of each state. Right now, it looks like California, Florida, New York, and Texas have very high cancer counts. Let's see if that holds true after we account for population.

First, we need to import population data. The 2022 state populations were obtained from World Population Review.

Add the cancer rates to the population dataframe:

Now that the cancer counts have been adjusted to population, we have better data that has removed the large variability. The states with the largest cancer rates are now Maine and West Virginia. Cancer rates are lowest in Nevada and Utah. Kansas and Minnesota both have 0, which is most likely due to lack of data as opposed to low cancer rates.

This bar graph is better than the previous one, but the data is still difficult to see and understand. It will be easier to view this data on a heatmap of the US.

Part 2: Water Violations

The next step will be to obtain the water pollutant violation data for 20 randomly chosen counties in the US. The water violation data was obtained from the US Environmental Protection Agency through the Enforcement and Compliance History Online tool. TRI and DMR Comparison Dashboard was used to find the total Toxic Release Inventory (TRI) and Discharge Monitoring Reports (DMR) pollutant load in pounds/year. The sum of the total toxic weighted pounds equivalent from both reports was used from 2010, a date that provided more consistent recordings but was far enough back in time to allow the time for the effect of carcinogens to begin to present themselves in the population.

The total weighted pounds equivalent is the mass of a pollutant or chemical discharged that accounts for its relative toxicity.

Example of Search Data:

image-2.png

2A: Find 20 random counties

The cancer rate data contains 3141 counties, so 20 random counties will be chosen between 1 and 3141. A seed will be chosen to make it easier to track all information throughout the analysis.

Merge County and Cancer Rate with corresponding index

Check to see if correct data was aligned

Load water violations data

Add water violations data to counties dataframe that has cancer rates

2B: Analyze Water Violations Data

Detect Outliers for removal using boxplots

Remove Outliers from Data

There are 3 obvious outliers in each of the boxplots, so we will remove those: Tuscaloosa, Mobile, and Karnes.

Rerun boxplots

There still appears to be an outlier in the Total Pounds Released Data, but we're going to move ahead to scatterplots and correlations.

Scatterplots of Data

It is possible that the counties with no cancer rates simply did not submit data, so we will remove those data points.

Remove cancer rates of zero

Rerun scatterplots

Calculate and plot correlations

Summary

It's obvious from the correlation data that there is absolutely no correlation between water pollution loads released into water systems and the incidence of cancer later in life with the current data. There are many reasons why the data are not currently showing a correlation. One possibility is invalid data, missing or incorrect data. It is unclear how reliable this data is. In addition, all pollutants were included, so extraneous information may be skewing results. Last, there are many, many factors beyond water contaminants that can lead to cancer, so trying to remove those possibilities from the data is impossible.

An interesting study would look at each particular chemical and see if there is any correlation between it and cancers that are linked to it.

Requirement 4

Grade for Assignment 14

I would assign a grade of A to this project. I spent a significant amount of time working on it (approximately 3 full days of work). In addition, I challenged myself by finding my own data sources and combining them into an analysis. I also challenged myself by learning how to do choropleth map of cancer rates. Last, I have included all the required elements, including NumPy, Pandas, Seaborn, and Matplotlib, used in various ways.

Requirement 5

Experiences with Assignment 14

I found this assignment to be the most challenging of all the assignments. So much interesting data exists, and I spent a large amount of time considering various datasets. It took physical effort to stop myself and choose a project. But the most challenging aspects involved using my own data and discovering issues that can arise when creating your own projects from scratch. For example, I discovered problems due to datatypes that I hadn't anticipated. When I tried to run the sum aggregation, it didn't work. It took a lot of Google fu to realize that I was trying to aggregate an object type and needed to covert it to a numeric type. Another example involved trying to merge data that had different indexes. I kept getting NaN values after the merge, and it took another long search to find the answer was aligning the indexes.

Even though it was incredibly challenging, I feel I have learned an incredible amount this week. The most important thing I learned is that I still have a lot more learning to do when it comes to everything we covered this semester.